10 research outputs found
On the power of conditional independence testing under model-X
For testing conditional independence (CI) of a response Y and a predictor X
given covariates Z, the recently introduced model-X (MX) framework has been the
subject of active methodological research, especially in the context of MX
knockoffs and their successful application to genome-wide association studies.
In this paper, we study the power of MX CI tests, yielding quantitative
explanations for empirically observed phenomena and novel insights to guide the
design of MX methodology. We show that any valid MX CI test must also be valid
conditionally on Y and Z; this conditioning allows us to reformulate the
problem as testing a point null hypothesis involving the conditional
distribution of X. The Neyman-Pearson lemma then implies that the conditional
randomization test (CRT) based on a likelihood statistic is the most powerful
MX CI test against a point alternative. We also obtain a related optimality
result for MX knockoffs. Switching to an asymptotic framework with arbitrarily
growing covariate dimension, we derive an expression for the limiting power of
the CRT against local semiparametric alternatives in terms of the prediction
error of the machine learning algorithm on which its test statistic is based.
Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error
control under the assumption that only the first two moments of X given Z are
known, a significant relaxation of the MX assumption
Covariance estimation using conjugate gradient for 3D classification in Cryo-EM
Classifying structural variability in noisy projections of biological
macromolecules is a central problem in Cryo-EM. In this work, we build on a
previous method for estimating the covariance matrix of the three-dimensional
structure present in the molecules being imaged. Our proposed method allows for
incorporation of contrast transfer function and non-uniform distribution of
viewing angles, making it more suitable for real-world data. We evaluate its
performance on a synthetic dataset and an experimental dataset obtained by
imaging a 70S ribosome complex
Reconciling model-X and doubly robust approaches to conditional independence testing
Model-X approaches to testing conditional independence between a predictor
and an outcome variable given a vector of covariates usually assume exact
knowledge of the conditional distribution of the predictor given the
covariates. Nevertheless, model-X methodologies are often deployed with this
conditional distribution learned in sample. We investigate the consequences of
this choice through the lens of the distilled conditional randomization test
(dCRT). We find that Type-I error control is still possible, but only if the
mean of the outcome variable given the covariates is estimated well enough.
This demonstrates that the dCRT is doubly robust, and motivates a comparison to
the generalized covariance measure (GCM) test, another doubly robust
conditional independence test. We prove that these two tests are asymptotically
equivalent, and show that the GCM test is in fact optimal against (generalized)
partially linear alternatives by leveraging semiparametric efficiency theory.
In an extensive simulation study, we compare the dCRT to the GCM test. We find
that the GCM test and the dCRT are quite similar in terms of both Type-I error
and power, and that post-lasso based test statistics (as compared to lasso
based statistics) can dramatically improve Type-I error control for both
methods
Large-scale simultaneous inference under dependence
Simultaneous, post-hoc inference is desirable in large-scale hypotheses
testing as it allows for exploration of data while deciding on criteria for
proclaiming discoveries. It was recently proved that all admissible post-hoc
inference methods for the number of true discoveries must be based on closed
testing. In this paper we investigate tractable and efficient closed testing
with local tests of different properties, such as monotonicty, symmetry and
separability, meaning that the test thresholds a monotonic or symmetric
function or a function of sums of test scores for the individual hypotheses.
This class includes well-known global null tests by Fisher, Stouffer and
Ruschendorf, as well as newly proposed ones based on harmonic means and Cauchy
combinations. Under monotonicity, we propose a new linear time statistic
("coma") that quantifies the cost of multiplicity adjustments. If the tests are
also symmetric and separable, we develop several fast (mostly linear-time)
algorithms for post-hoc inference, making closed testing tractable. Paired with
recent advances in global null tests based on generalized means, our work
immediately instantiates a series of simultaneous inference methods that can
handle many complex dependence structures and signal compositions. We provide
guidance on choosing from these methods via theoretical investigation of the
conservativeness and sensitivity for different local tests, as well as
simulations that find analogous behavior for local tests and full closed
testing. One result of independent interest is the following: if
are -values from a multivariate Gaussian with arbitrary
covariance, then their arithmetic average P satisfies for
.Comment: 40 page
COVARIANCE ESTIMATION USING CONJUGATE GRADIENT FOR 3D CLASSIFICATION IN CRYO-EM
ABSTRACT Classifying structural variability in noisy projections of biological macromolecules is a central problem in Cryo-EM. In this work, we build on a previous method for estimating the covariance matrix of the three-dimensional structure present in the molecules being imaged. Our proposed method allows for incorporation of contrast transfer function and non-uniform distribution of viewing angles, making it more suitable for real-world data. We evaluate its performance on a synthetic dataset and an experimental dataset obtained by imaging a 70S ribosome complex
Exponential family measurement error models for single-cell CRISPR screens
CRISPR genome engineering and single-cell RNA sequencing have transformed
biological discovery. Single-cell CRISPR screens unite these two technologies,
linking genetic perturbations in individual cells to changes in gene expression
and illuminating regulatory networks underlying diseases. Despite their
promise, single-cell CRISPR screens present substantial statistical challenges.
We demonstrate through theoretical and real data analyses that a standard
method for estimation and inference in single-cell CRISPR screens --
"thresholded regression" -- exhibits attenuation bias and a bias-variance
tradeoff as a function of an intrinsic, challenging-to-select tuning parameter.
To overcome these difficulties, we introduce GLM-EIV ("GLM-based
errors-in-variables"), a new method for single-cell CRISPR screen analysis.
GLM-EIV extends the classical errors-in-variables model to responses and noisy
predictors that are exponential family-distributed and potentially impacted by
the same set of confounding variables. We develop a computational
infrastructure to deploy GLM-EIV across tens or hundreds of nodes on clouds
(e.g., Microsoft Azure) and high-performance clusters. Leveraging this
infrastructure, we apply GLM-EIV to analyze two recent, large-scale,
single-cell CRISPR screen datasets, demonstrating improved performance in
challenging problem settings.Comment: 95 pages (35 pages main text